# A survey on FPGA-based accelerator for ML models

Feng Yan<sup>1</sup>
University of Auckland
Auckland, New Zealand
fyan691@aucklanduni.ac.nz

Andreas Koch<sup>2</sup>
Technische Universität Darmstadt
Darmstadt, Germany
koch@esa.informatik.tu-darmstadt.de

Oliver Sinnen<sup>1</sup>
University of Auckland
Auckland, New Zealand
o.sinnen@auckland.ac.nz

Abstract—This paper thoroughly surveys machine learning (ML) algorithms acceleration in hardware accelerators, focusing on Field-Programmable Gate Arrays (FPGAs). It reviews 287 out of 1138 papers from the past six years, sourced from four top FPGA conferences. Such selection underscores the increasing integration of ML and FPGA technologies and their mutual importance in technological advancement. Research clearly emphasises inference acceleration (81%) compared to training acceleration (13%). Additionally, the findings reveals that CNN dominates current FPGA acceleration research while emerging models like GNN show obvious growth trends. The categorization of the FPGA research papers reveals a wide range of topics, demonstrating the growing relevance of ML in FPGA research. This comprehensive analysis provides valuable insights into the current trends and future directions of FPGA research in the context of ML applications.

Index Terms—machine learning accelerator; energy efficiency; Optimization strategies; Machine learning; FPGA; FCCM; FPL; FPT

# I. INTRODUCTION

ML (an important subset of artificial intelligence) focuses on algorithms that learn from data to autonomously perform tasks and predict outcomes on new data without direct programming. In recent years, research on ML has shown promising results in several important domains, including image segmentation [1], object classification [2], [3] and detection [4], data classification [5], natural language processing (NLP) [6], edge computing [7], large-scale scientific computing [8], and even for circuit designing or optimizing [9].

Moreover, ML models can be deeper and larger to improve accuracy; significant redundancy may exist in these often over-parameterized models [10]. These models need a lot of computational resources and memory for training and inference. While Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are the dominant computing devices for ML, each has shortcomings. CPUs struggle to meet high-performance demands as they are designed for general-purpose tasks through mostly sequential computing. Conversely, GPUs are favored for their parallel processing prowess in intensive ML applications. Yet, this comes at a cost: implementing algorithms on GPUs often leads to substantial energy consumption, a critical drawback in energy-sensitive environments. Thus, custom architectures and development methods adapted to ML

algorithms can perform better. The FPGA is a reconfigurable medium whose logic units, interconnections, processing elements and memory units can change function before or at runtime while completing a program.

Recognizing the versatile nature of FPGA as a platform for ML, this survey delves into the implementation of FPGA-based accelerators. Focusing on their application in model inference and training, this introduction aims to clarify the advantages and challenges FPGAs face in these domains. With their substantial computing resources, deployment flexibility, and high energy efficiency, FPGAs have emerged as a promising platform for implementing ML algorithms. The adaptability of FPGAs, attributable to their reconfigurable architecture, makes them particularly suitable for diverse ML applications, ranging from edge computing to large-scale data center demands.

This also raises several questions worth thinking about. Does the architectural design of FPGA-based accelerators predominantly orient towards model inference rather than training? Furthermore, how do the FPGA perform in ML inference and training tasks? Parallel processing capabilities and reconfigurable architecture are beneficial for real-time inference tasks. Nonetheless, the strict computational requirements and the need for extensive data handling during the training phase pose considerable challenges for FPGA.

In ML, neural network models are widely used, especially in computer vision, due to their complex data processing capabilities. However, non-neural network models remain crucial in specific fields for their straightforwardness. This leads to an important question: Are FPGA-based ML accelerators better suited for neural network models than non-neural ones? Neural networks, with their layered complexity, benefit from the parallel processing power of FPGAs. On the other hand, non-neural network models need a more specific design to meet the model structure. The critical issue extends to how FPGAs handle the different requirements of these models, particularly in parallel processing.

This article presents a comprehensive overview, focusing on the advancements in FPGA technology showcased at the four most famous conferences in this field over the past six years. As shown in Figure 1, the research direction can be divided into four main categories, each representing a significant area of FPGA-related studies. The pie chart visually represents the distribution of papers across these categories:

- Application and Design Studies dominates with 48% of the corpus, comprising 544 papers.
- ML follows as the second-largest category, representing 25% of the total with 288 papers.
- Architecture, CAD, and Circuit Design accounts for 15% with 170 papers.
- High-level Tools and Abstraction makes up the remaining 12% with 137 papers.

This distribution illustrates the multifaceted nature of FPGA research, concentrating on application-oriented studies and design optimizations, accounting for nearly half of the corpus (48%). This proportion reflects the wide application scenarios and the continuous efforts to enhance FPGA design. Notably, ML emerges as a significant category, comprising 25% of the total papers. This considerable representation underscores the importance of ML, particularly in the context of neural network model deployments on FPGAs. The prominence of this category signifies the growing synergy between FPGA technology and advanced ML algorithms. Furthermore, as illustrated in Figure 2 (note: ensure you have this figure), there is an obvious 5% increase in ML-related FPGA research starting from 2022. This trend gives an accelerating integration of ML techniques within the FPGA domain, suggesting a pivotal development towards more complex and intelligent FPGA applications in the near future.



Fig. 1. Distribution of FPGA accelerator directions

This survey intentionally narrows its scope by recognizing the complexity and diversity of FPGA applications in the industry, often influenced by commercial considerations. Concentrating solely on academic conferences provides a focused exploration of FPGA technology's latest research and developments, thus offering a clear and academically oriented perspective on this rapidly evolving field.

## II. COMPUTING PROCESS

We start our discussion with the computing process, as it guides the optimization techniques employment, evaluation metrics and even the target platforms. This chapter delves into four aspects of FPGA-based ML acceleration research: (A) the proportion of inference versus training papers, (B) the distribution of model types across inference and training



Fig. 2. ML Related in past 6 years

papers, (C) trends in model inference, and (D) trends in model training.

## A. Inference vs Training

This chapter analyses the distribution of papers that study inference versus training of ML on FPGAs and explores the technical motivations and application requirements behind this distribution. As shown in Figure 3, model inference occupies a dominant position, with up to 81% of the published research focusing on this. In comparison, model training accounts for only 13%, while matrix (vector) multiplication, as a basic operation, accounts for 6%. This significant unbalanced distribution reflects FPGAs' current research focus and application direction in ML acceleration.



Fig. 3. Computation acceleration proportion

#### 1) Reasons for dominance of Inference:

a) Low Latency Requirements: In certain application domains (e.g. real-time image recognition), the speed and response time of inference are crucial. FPGAs' customizability and parallel processing capabilities enable them to effectively meet these low-latency requirements.

The deployment of ML in various real-time applications faces severe latency and throughput challenges. In autonomous driving and video surveillance, millisecond-level latency directly affects safety [11]–[15]. At the same time, the rise of the Internet of Things (IoT) and edge computing [16]–[20] requires efficient processing of massive sensor data on resource-constrained device. In addition, many applications

need to process large-scale data streams, which places higher requirements on the real-time performance of the system [21]–[23]. To address these challenges, researchers have proposed various hardware acceleration schemes according to FPGAs' customizability and parallel processing capabilities.

FPGAs' customizable parallel architecture enables designers to optimize processing units for specific algorithms, thereby maximizing parallelism [15], [24]–[34]. The flexible memory hierarchy of FPGAs allows designers to optimize memory access patterns, thus reducing latency caused by data movement [30], [35]–[49]. Moreover, FPGAs support flexible data types and bit widths, facilitating an optimal balance between precision and speed [19], [50]–[57]. The dynamic reconfiguration capability of FPGAs further enhances performance by allowing systems to adjust their hardware structure in response to real-time demands [58]–[62].

These combined features make FPGAs an ideal platform for implementing low-latency ML inference. The customizable nature of FPGAs not only addresses the need for low latency but also provides a versatile solution adaptable to diverse algorithmic requirements and operational contexts.

b) Efficiency Considerations: Besides low inference time, energy efficiency has become a key design consideration in edge computing or other applications, particularly in the Internet of Things(IoT) and mobile devices. IoT devices, battery-powered, demand high energy efficiency [16], [17], [19], [20]. Similarly, AI applications in mobile devices [11], [23], [63] benefit from the energy efficiency offered by FPGAs.

FPGAs demonstrate outstanding energy efficiency when executing fixed inference tasks through several key mechanisms. Firstly, customized hardware allows for the reduction of unnecessary energy consumption by tailoring the architecture to specific tasks [22], [25], [64], [65]. This stability is complemented by the deterministic data flow during inference, facilitating the implementation of efficient data transmission pathways [36], [46], [66], [66]–[70].

Secondly, dynamic power management [59], [71]–[73] enables FPGAs to optimize energy usage through real-time adjustments. Lastly, low-precision computing significantly reduces energy consumption while maintaining accuracy [74]–[81].

These characteristics enable FPGA accelerators to achieve high energy efficiency and low power consumption when optimizing inference tasks.

Compared to GPUs, FPGAs exhibit superior energy efficiency in inference tasks [82], [83]. In contrast to ASICs [84]–[87], FPGAs offer a balanced approach between flexibility and energy efficiency, making them particularly suitable for evolving AI applications.

FPGAs show energy efficiency advantages in fixed inference tasks, mainly due to their unique architectural design and optimization strategy. Stream processing architecture is one of the factors for FPGAs to achieve high energy efficiency [12], [88]–[94]. This architecture allows data to flow efficiently between processing units, reducing unnecessary data movement and storage, thereby reducing energy consumption.

Memory optimization is another aspect of improving FPGA energy efficiency. Effective memory management strategies can significantly reduce energy consumption caused by data movement [43], [45], [66], [95]. By optimizing data flow and caching strategies, FPGAs can minimize external memory accesses and reduce overall power consumption. Additionally, compute-capable block RAM [35], [96] technology provides new possibilities for deep learning acceleration on FPGAs by integrating computing into storage units.

By taking full advantage of these features, designers can implement highly specialized and optimized inference accelerators on FPGAs, improving performance, energy efficiency, and resource utilization.

- 2) Challenges and Potential of Training Acceleration Research:
- a) Data Processing Complexity: Data processing requirements present several challenges for FPGA accelerators designed for AI training. These challenges can be categorized into computational demands, data management complexities and dstributed training complexities.

The processing of large-scale datasets tests FPGAs' computing and storage capabilities. A primary challenge lies in the limited on-chip memory resources of FPGAs, which constrains the amount of data that can be processed simultaneously [41], [97], [98]. This limitation is compounded by data transmission bottlenecks between off-chip memory communication, creating a hurdle in data flow efficiency during AI training processes.

AI training also requires real-time data stream processing [12], [71], which introduces additional complexity to FPGA accelerator design. Continuous adaptation to large volumes of incoming data demands complex mechanisms for dynamic reconfiguration of FPGA resources. Moreover, maintaining low latency while processing high-throughput data streams presents a technical challenge [99]–[101]. FPGA designs need to meet a delicate balance of immediate processing needs during the long training process, increasing the complexity of implementing AI training accelerators.

Distributed training across multiple FPGAs is a good methodology employed to deal with the above challenges. However, data synchronization across FPGA nodes becomes a new critical issue, as mentioned by research on FPGA clusters for distributed CNN training [102]–[104]. The efficient distribution of workload and data across the FPGA cluster is essential for optimal performance, yet it introduces intricate coordination problems. Minimizing communication overhead while maintaining training efficiency presents a challenge in the design of large-scale ML training systems.

In response to the above-mentioned memory management and data transmission challenges during AI training, researchers have proposed sparsification and compression technology strategies. Both the hybrid granularity sparse training accelerator [41], [105]–[107] and the block weight compression scheme [14], [108], [109] effectively reduce the amount of data and memory requirements, thereby lighten the pressure on FPGA on-chip resources.

Meanwhile, research on static block floating point quantization [110] and dynamic quantization [111], [112] methods, respectively, explore how to reduce computational and storage overhead while maintaining model performance. These strategies not only address the memory limitation of FPGAs, but also alleviate the data transmission bottleneck, accelerating AI training on FPGAs.

b) Algorithm Complexity: Hardware design faces challenges in implementing complex algorithms such as backpropagation.

The backpropagation algorithm presents challenges due to its complexity. The intricacy lies in gradient calculation, which involves complex mathematical operations across multiple levels of propagation [98], [113]. This multi-tiered computational structure amplifies complexity, particularly when applied to large-scale models. Furthermore, implementing stochastic gradient descent (SGD) and its variants introduces additional complexities, such as managing randomness and adaptive learning rates [12], [114].

Although optimization algorithms such as batch normalization and regularization play a key role in improving model performance, their additional processing of network parameters and activation values increases the complexity of hardware designs. To be specific, batch normalization accelerates training convergence and improves model stability by standardizing the input of each layer [106], [115]. However, its implementation requires calculating the statistics of the entire mini-batch. This global operation is difficult to efficiently parallelize on FPGAs and may become a performance bottleneck. Regularization techniques such as L1/L2 regularization [41], [110] and Dropout [116] are relatively simple in theory but require additional weight decay when updating parameters and dynamically "turning off" some neurons during training, respectively.

In order to handle the challenges mentioned above, researchers have proposed a series of innovative optimization strategies. Mixed precision computing [117], [118] and compute unit optimization [107], [119] improve the efficiency of the backpropagation algorithm. Adaptive quantization [111] provides a potential solution to the irregular memory access patterns in optimization algorithms such as SGD, while it was mentioned for inference. The performance bottleneck of batch normalization can be relieved by borrowing the concept of streaming processing [101], [120]. In addition, structured sparsification [41], [121] and hardware-aware training [122] are designed to efficiently implement regularization techniques such as Dropout. Moreover, algorithm-hardware co-design [101], [123] and new computing paradigms [124] provide a more systematic angle to solve the challenges of complex algorithm implementation.

3) The Fundamental Role of Matrix Operations: Although matrix (vector) multiplication accounts for a small proportion of the research paper distribution(6%), as a fundamental operation of ML algorithms, it considerably impacts overall performance. Optimizing matrix operations can fundamentally improve the performance of various models.

Matrix multiplication is the core operation of deep learning and neural network calculations [125]. As the complexity of neural network models increases, matrix multiplication has become a major bottleneck for computationally intensive tasks. Through its parallel processing capabilities [37], [124], the FPGA platform realizes optimized matrix multiplication that can achieve significant performance improvements at 32-bit floating point precision.

The reconfigurability [126], [127] of the FPGA platform also optimizes matrix multiplication in flexible adaptation to different accuracy requirements and energy efficiency goals. This flexibility enables optimization strategies to play an important role in various ML tasks, from low-precision, energy-efficient embedded AI applications [124] to high-precision, high-performance large-scale deep learning models [105]. Among them, the advantages of FPGAs are more obvious when dealing with sparse matrices for applications such as GNN [128], [129].

Matrix multiplication optimization on the FPGA platform improves performance and promotes hardware algorithm innovation, further impacting ML model performance. The development of new FPGA architectures facilitates more efficient implementations of matrix multiplication. This collaborative innovation is mainly reflected in two aspects: First, by designing a dedicated matrix multiplication circuit, FPGAs can achieve higher computing efficiency than a general-purpose processor [130], [131]. Second, the programmability of FPGAs allows researchers to optimize the implementation of matrix multiplication based on the structure and needs of a specific neural network [125], [129].

FPGA-based matrix operation optimization improves ML performance through bottleneck reduction, precision flexibility, and hardware-algorithm synergy. Despite limited research numbers, its impact is profound. This limited research is likely due to GPU's dominance in matrix computations, with its efficient architecture and mature ecosystem. Continued focus in this area can accelerate overall ML progress.

## B. Accelerators for different models

1) Dominance of CNN: CNNs lead the way in inference works with 122 papers and there are also 23 papers on CNN training, showing the absolute dominance of CNNs in FPGA acceleration research. This dominance can be explained from the following perspectives:

The outstanding performance of CNNs in computer vision tasks makes them the preferred model for image recognition, object detection, and other fields. CNNs have been widely deployed in numerous real-time applications where rapid inference and swift response times are paramount. These algorithms have exhibited exceptional performance across a diverse range of tasks, including:

- 1) Image classification [23], [26], [58], [91], [132]–[136]
- 2) Object detection [11], [43], [63], [70], [137]–[141]
- 3) Robotics and Autonomous Systems [68], [70], [142]
- 4) Human Action Recognition [143], [144]
- 5) Speech and Audio Processing [110], [145], [146]



Fig. 4. Distribution of Model Types in Inference and Training Paper



Fig. 5. Surveyed paper numbers on typical ML models by year

The convolution operations in CNNs possess a high degree of regularity and parallelism, making them well-suited to the hardware characteristics of FPGAs. First, the local connection feature of CNNs: each neuron in the convolutional layer is only connected to a local area of the input data, which suits FPGA memory and computing unit structure. This local connection reduces the need for data transmission and enables data to be processed locally, thereby reducing communication overhead and improving computing efficiency [15], [147], [148].

Secondly, the weight sharing mechanism in CNNs, that is, sliding and reusing the same convolution kernel on the entire input data, is highly consistent with the reconfigurable characteristics of the lookup table (LUT) and digital signal processing (DSP) blocks in FPGAs. Weight sharing reduces the required storage resources while improving the reusability of calculations [149]–[151].

Thirdly, the convolution operation of CNN is naturally parallel, which coincides with the parallel processing capability of FPGAs. Multiple convolution kernels in CNN can process different parts of the input data simultaneously, and FPGAs can perform multiple computing tasks simultaneously through

their parallel processing units [31], [65], [152], thereby greatly accelerating the entire convolution process.

Finally, the maturity of CNN model optimization techniques has further guaranteed their leading position in FPGA accelerator design. These techniques include:

- a) Quantization: Quantization enables CNNs to run efficiently on FPGA's limited-precision hardware while maintaining performance. By converting model parameters from floating-point to fixed-point numbers [21], [110], [119], [153], or even adopting binary or ternary methods [14], [154], quantization techniques [51], [57], [75], [112], [132], [155] significantly reduce the model's storage requirements and computational complexity, which is crucial for resource-constrained FPGAs platforms.
- b) Pruning: Pruning strategies, by removing redundant weights and neurons, reduce the complexity and size of the model [143], [151], [156], [157], allowing more complex CNN models to be adapted to resource-limited FPGAs. This approach decreases the model's storage footprint and reduces the computational burden, thereby improving operational efficiency.

While there are numerous papers on CNN acceleration, they fundamentally follow the same basic principles in FPGA implementation, leveraging CNN's inherent features of local connection, weight sharing, and natural parallelism. The differences mainly lie in how these principles are applied to meet specific application requirements. For instance, image recognition applications demand high throughput, while object detection requires low latency and real-time processing. These varying requirements lead to different optimization strategies in quantization schemes, pruning approaches, and memory access patterns. The key to successful implementation lies in how to effectively combine and select existing solutions, adjust

specific parameters, and balance resource allocation based on application-specific needs.

The widespread deployment of CNN applications and the maturity of technology have promoted each other, showing its development history in the field of FPGA acceleration. In order to deeply understand this evolution process and explore future development directions, this article analyzes the trend of accelerator design between 2018 and 2023. The research shows that the development during this period can be divided into three main stages: rapid growth, stable development, and continuous decline.

- c) Growing period (2018-2019): This initial stage saw rapid growth, with the number of research papers growing from 22 to 28 (an increase of 27%). This growth is attributed to three factors: the outstanding performance of CNN in computer vision has given rise to the demand for acceleration [23], [132], [134], among these, DLA [158] uses overlay to achieve a GoogLeNet processing speed of 900 fps on Intel Arria 10; the rise of edge computing has created a need for low-latency and energy-efficient inference, prompting lightweight CNN accelerator architectures for edge devices [15], [57], [63], [159], [160]; Multi-CNN mapping and complex architecture optimization reflect researchers' pursuit of more complex and efficient CNN implementations [159], [161], [162].
- d) Stable period (2019-2020): During this period, the number of papers remained at around 28, showing the maturity of CNN acceleration technology. During this period, the research focus shifted from architecture design to optimization techniques such as quantization and sparsification [11], [14], [57], [75], [110], [154], [155], [163]. Meanwhile, researchers started specific optimizations for applications such as target detection [11], [14], [15], [23], [135], [137], speech recognition [110], and image segmentation [110]. And the fusion acceleration strategy of CNN and other models has attracted attention. This fusion handles more complex tasks like time series data analysis and multi-modal learning [164].
- e) Decline period (2020-2023): Since 2020, the number of CNN research has declined yearly, falling to about 13 in 2023, with an average annual decline of 22%. This trend reflects the following aspects: Firstly, systolic array architectures have been extensively studied and optimized [31], [165]–[168]. The proposed design [159] achieves nearly 98% DSP utilization for the systolic array structure. This near-limit utilization indicates that CNN acceleration based on systolic arrays has reached a fairly high level of maturity. Data flow optimization techniques have been studied in depth and applied in various FPGAs accelerators [21], [66], [151], [157], [168]. Memory access optimization techniques, such as data reuse and caching strategies, have been developed quite maturely [21], [38], [43], [48], [51], [66], [151], [156], [157], [168].

Secondly, with the mentioned technologies matured, CNN accelerators have achieved remarkable performance levels, demonstrating significant advancements in both computational power and energy efficiency. The throughput of modern CNN accelerators has achieved thousands of GFLOPS/s or images/s, several times greater than NVIDIA's V100 GPU [21], [26],

[38], [51], [53], [58], [65], [66], [80], [93], [154], [163], [165], [168]–[173]. The highest computational performance recorded is 2.41 TOPS, as achieved by [154], while the record for the highest number of images processed per second stands at 4550, achieved by [65], which is four times greater than the performance of the V100 GPU. In addition to raw performance improvements, CNN accelerators have been optimized in energy efficiency. Research efforts have led to reductions in energy consumption, making accelerators far more suitable for energy-constrained environments [58], [66], [71], [84], [151], [163], [174]–[176]. A notable study [174] reports a saving of 119 milli-joules per frame compared to the energy consumption of the Tesla V100 GPU.

Finally, research attention has increasingly shifted away from CNN acceleration towards emerging models like Transformer [177], [178] and GNN models [143]. Several studies evidence this shift in research focus. For instance, Auto-ViT-Acc achieved a frame rate increase of about 5.6 times on the ImageNet dataset, with only a 0.71% reduction in accuracy [178]. Similarly, Zhang et al. introduced a GNN model for Synthetic Aperture Radar (SAR) automatic target recognition (ATR). Compared to traditional CNN methods, their lightweight GNN model achieved comparable accuracy while reducing computational complexity to just 1/3258 of the original [143].

In summary, CNN research has experienced a process from rapid growth to maturity in FPGAs acceleration. Although the research enthusiasm has declined, its importance in practical applications cannot be ignored. Future research may focus more on combining CNN and new models and deep optimization in specific application scenarios.

2) RNNs: In ML accelerator research, RNNs are the second most popular model. There are several reasons for that:

Firstly, RNNs' status as a research hotspot links to widespread applications across multiple domains. According to data from research papers we surveyed, RNN-related publications (27 papers) are second only to CNNs (145 papers), far surpassing other models. RNNs play a crucial role in natural language processing [73], [74], [179] and time series analysis tasks, effectively handling variable-length sequence inputs and capturing temporal dependencies, which gives them significant advantages in areas like speech recognition [81] and machine translation [180]. To meet the high demands for real-time performance and efficiency in these applications, research and development of FPGA accelerators have resulted in the growth in RNN-related publications.

Secondly, RNNs' computational patterns present specific implementation challenges on FPGA. The recurrent structure of RNNs results in strict data dependencies [24], [181], [182] and irregular memory access patterns [41], [42], [73], contrasting with traditional parallel computing paradigms. These challenges have inspired researchers to explore innovative hardware architectures and acceleration strategies. Compared to the regular computational patterns of CNNs, the complexity of RNNs serves both as a limiting factor in the quantity of research and as a driving force in maintaining research interest.

This computational uniqueness offers optimization space for FPGAs accelerator design.

Lastly, the continuous innovation in RNN model variants also contributed to FPGA acceleration research. Advanced RNN variants such as Long Short-Term Memory (LSTM) [42], [74], [81], [83] and Gated Recurrent Units (GRU) [27], [158] effectively address the long-term dependency problems faced by traditional RNNs through the introduction of gating mechanisms. While these variants increase computational complexity, they significantly enhance the model's expressive power and range of applications. This ongoing innovation at the model level not only expands the application prospects of RNNs but also provides new research directions and optimization targets for FPGA accelerator design.

The important position of RNNs in neural network accelerator research stems from their broad application value, unique computational challenges, and continuous model innovation. These three aspects interact to jointly promote in-depth research on RNN-related FPGAs acceleration.

Research on RNN accelerators shows a relatively stable but fluctuating trend. Relatively stable in the early stage (2018-2020): RNN accelerators saw consistent interest with several papers published each year, as RNN continued to be explored for time-series data processing.

The research during this period mainly focused on two directions: one is to reduce the timing dependency of RNN [24], [32], [82], and the other is to improve its parallel processing capability [27], [33], [36]. At the same time, at the application level, RNN variants such as LSTM and GRU have been widely used in tasks such as timing prediction and speech recognition [42], [73], [74], [179], further promoting the development of related FPGAs acceleration.

Between 2020 and 2022, there was a noticeable decline in RNN-related publications, with limited articles published during this period. This trend may be due to the rise of attention mechanisms and Transformer models in traditional RNN application areas (such as NLP) [178], [183], which has distracted research focus. At the same time, the efficiency bottleneck faced by RNNs when processing long sequences also hinders further breakthroughs.

In 2023, the number of papers on RNN acceleration increased again to about 5, primarily driven by technological innovations and advancements in hardware. The DGNN-Booster [181] framework and the MSBF-LSTM [81] algorithm have opened new pathways for RNN acceleration, while bandwidth-oriented pruning strategies have effectively addressed the bandwidth bottleneck in FPGAs implementations [116]. The new generation of FPGAs offers richer resources and greater flexibility, creating favorable conditions for implementing complex RNN acceleration schemes. The simultaneous development of hardware resource [184] improvements and low-precision computing techniques [81] has made running RNNs on resource-constrained FPGAs more efficient.

In the future, RNN research may focus more on integration with other models and optimized applications in specific fields.

3) GNNs: Graph Neural Network (GNN) research ranks third in FPGAs acceleration research, reflecting the importance and unique properties of GNN models.

Regarding computational patterns, the operation of GNNs involves information aggregation and updates between nodes [40], [143]. This fundamentally differs from the convolution operations in CNNs and sequential processing in RNNs, presenting challenges and opportunities for FPGA implementation. The requirement for GNN models to process dynamically changing graph structures has led to specialized FPGA architectures that can adapt their data paths and memory access patterns on-the-fly [44], [181], stimulating research into innovative acceleration architectures. Concurrently, the sparsity of graph data offers potential for computational efficiency improvements, with FPGAs serving as an ideal platform for exploring sparse GNN acceleration due to their customizability [46], [49].

Additionally, from the perspective of model evolution and application expansion, the GNN domain is extended by rapid algorithmic innovation, exemplified by the emergence of Graph Attention Networks (GAT) and Graph Isomorphism Networks (GIN) [44], [97], [131], [185]. This progression has correspondingly forced FPGA acceleration research. The cross-domain application prospects of GNNs in areas such as recommendation systems, drug discovery, and traffic prediction have motivated researchers to explore versatile and efficient FPGA acceleration solutions. Furthermore, integrating GNNs with other models, such as temporal-spatial GNN [181], has introduced new research directions in FPGA accelerator design.

Through customizable memory access paths, FPGA-based GNN accelerators can efficiently process irregular graph data structures while achieving balanced distribution of computing tasks [40], [46], [49], [143], which has promoted the rapid development of related research. This advantage is directly reflected in the increase in research enthusiasm in recent years. According to the charts and data, the number of FPGAs accelerator studies for GNN models shows a continuous upward trend, increasing from 1 paper in 2020 to 10 papers in 2023.

The development and application scope of the GNN model itself have expanded. In recent years, GNNs have shown strong performance in traffic prediction [97], dynamic graph analysis [181], and have led to the growth of demand for GNN acceleration [97], [181], [185]–[189].

The size and complexity of graph data are increasing. With the advent of big data, the demand for processing large-scale graph data has surged. Traditional processing methods, such as a single CPU or GPU, often cannot cope effectively. Training a graph ML model may take hours or even days [97]. In addition, many GNNs cannot be simplified to simple matrix multiplications. Processing these complex and irregular data structures requires specialized graph preprocessing and model calculation modes [189]. In this context,FPGAs provide an effective GNN acceleration solution, and their customizable datapath design can adapt to the irregular access patterns of graph data, thereby achieving efficient training and inference.

[46], [181], [190].

FPGAs perform well in static graph data processing and show strong performance in dynamic graph updating and reasoning. For example, FPGAs can reduce the communication between the CPU and FPGAs and improve training efficiency through the mini-batch algorithm of subgraphs [190]. In addition, the specially designed accelerator framework of FPGAs can be used for dynamic GNN reasoning and update processing, giving full play to the advantages of FPGAs in processing irregular and dynamic data structures [44], [181].

Compared with CNN's regular matrix operations, GNN's graph-structured computations require dynamic memory access patterns and irregular data flows, which can be efficiently implemented through FPGA's customizable data paths and memory hierarchies.

Improvements in FPGA hardware and advances in hardware-software co-design allow researchers to optimize data layout and design computing pipelines for GNNs, fully utilizing the computational power and memory bandwidth of the new generation of FPGA chips. For example, H-GCN proposed a hybrid accelerator based on the Xilinx Versal ACAP architecture, which divides the graph into different subgraphs through software-hardware collaboration, processing them using programmable logic (PL) and AI engines (AIE) respectively [190]. SDMA accelerates sparse-dense matrix multiplication through three hardware optimization strategies: equal-value partitioning, vertex clustering optimization and adaptive on-chip data flow scheduling [44]. SkeletonGCN improves the DSP utilization of FPGAs and enhances the training efficiency of GCN by introducing software-hardware collaborative optimization methods, including quantization, simplification of nonlinear operations, and intermediate result reuse [97]. Furthermore, the DGNN-Booster framework uses two distinct data flow designs to optimize dynamic graph network inference performance using high-level synthesis (HLS) technology [181].

In conclusion, the growing complexity of graph data and demand for GNN acceleration have led to advances in FPGA-based solutions. By leveraging FPGA's parallelism and flexibility, researchers have optimized GNN training and inference, especially for irregular graph structures. As FPGA technology evolves, it will continue to enhance GNN performance and broaden its applications.

4) Attention networks: Different with mature NN models, the attention mechanism and its representative architecture transformer mark an important breakthrough in ML. The attention mechanism has unique advantages: unlike CNN relies on fixed feature extraction or RNN processes information sequentially, attention can directly establish associations between any positions in the input sequence, realizing true global information interaction [183], [191]. This design not only improves the expressiveness of the model, but also makes parallel computing possible.

The Transformer architecture achieved a initial breakthrough in NLP tasks through the self-attention mechanism. The TRAC framework has shown that it overcomes the limitations of traditional sequence models and provides better long-range dependency modeling capabilities while maintaining parallel processing efficiency [183], [191]. This success has promoted the expansion of the attention mechanism to computer vision. ViT broke through the limitation of CNN's fixed receptive field by introducing an adaptive attention mechanism [177]. The optimized ViT achieved a 5.6-fold performance improvement in the ImageNet classification task, with only a 0.71% decrease in accuracy [178], verifying the potential of the attention mechanism in the global information understanding scenario [192].

However, the computational nature of the attention mechanism poses challenges. Its core operations involve a large number of matrix multiplications, and the computational complexity grows quadratically with the sequence length [183]. At the same time, the dynamic calculation of attention weights requires frequent accesses to memory and complex non-linear operations [191]. These computational challenges have driven researchers to explore solutions on FPGA platforms. Despite the relatively small number of studies (from 1 to 5 between 2020-2023), this growth trend highlights the potential of FPGAs in transformer networks.

The breakthrough success of the Transformer model in NLP has triggered the demand for hardware acceleration. In the early stages of research, the focus was on optimizing nonlinear computations. Li et al. [53] proposed a low-cost reconfigurable nonlinear core that supports a variety of nonlinear operations based on input range reduction and polynomial approximation. The core can accelerate the calculation of different nonlinear layers by configuring the content and data path of the lookup table (LUTs). Feng et al. [193] ensured the high accuracy of nonlinear activation functions (NAFs) through a non-uniform piece-wise linear approximation method, and designed flexible data paths and shared hardware resources, reducing the use of lookup tables and DSPs, and further optimizing the efficiency of hardware resource utilization.

FPGA's configurable units adapt to different Transformer architectures, as shown in LTrans-OPU's non-linear layer design and TRAC's matrix operation optimization. In LTrans-OPU [191], the reconfigurable non-linear core demonstrated a low-cost, high-efficiency acceleration of the non-linear layers. Likewise, TRAC [183] supports different model architectures through compilation optimizations, adapting to varying matrix operation requirements. The attention mechanism's core matrix operations benefit from FPGA's dual-array design, which as demonstrated in Calabash [192] achieved 2.3x speedup in processing self-attention computations through optimized matrix multiplication pipelines.

In addition, optimization techniques such as precision flexibility and memory storage optimization further enhance FPGA performance in attention networks. The authors of TRAC [183] investigated the application of weight compression techniques, reporting a 12-fold reduction in LUT usage and a 2-fold reduction in DSP hardware resource consumption. Token Packing introduced an optimized memory subsystem design to efficiently manage complex data streams. These optimization

techniques have further propelled the development of attention mechanisms on FPGAs.

The rapid growth of attention mechanism research reflects the FPGAs community's sensitivity and adaptability to emerging AI models. As Transformer and its variants are applied in more fields, we can expect this research direction to continue to be active and may surpass the research popularity of traditional RNN in the next few years.

- 5) Other ML models: Although neural network models dominate, traditional ML models still have a place in FPGA acceleration research. From the perspective of model types, current research mainly covers the following types of traditional ML models:
  - Distance-based models, such as K-nearest neighbor (k-NN) and K-means clustering (K-means) algorithms
  - 2) Probabilistic models, represented by bayesian networks
  - Decision tree models, including ensemble learning methods such as random forests and XGBoost
  - 4) Reinforcement learning models, especially in hardware optimization applications

The continued existence of traditional ML models in FPGA acceleration research is due to the advantages of these models in specific application scenarios and the good match of FPGA architecture to their computing characteristics. Taking K-nearest neighbours (k-NN) and K-means as examples, these algorithms are still widely used in image retrieval, cluster analysis and other fields [63], [194]–[196]. The KPynq system [197] and other k-NN implementations mentioned in the literature demonstrate performance improvements of FPGAs in accelerating such algorithms. For example, the K-means accelerator proposed by Hu et al. utilizes the most significant digit first (MSDF) arithmetic, combined with the parallel processing capabilities of FPGAs, to achieve efficient distance calculation and comparison operations [196].

These implementations not only increase the execution speed of algorithms but also reduce energy consumption, allowing traditional algorithms to remain competitive in large-scale data processing and real-time applications. In addition, the k-NN FPGA implementation based on online arithmetic proposed by Gorgin et al. [198] achieves a speed increase of up to 34% compared to the existing best design by utilizing digital-level pipelines and dynamic termination of unnecessary calculations. At the same time, the method proposed by Kim et al [195] to use computing storage devices to accelerate large-scale neighbour searches demonstrates the potential of combining traditional algorithms with new hardware architectures, providing an efficient and energy-saving solution for data centre-level applications.

Some models, such as bayesian networks and decision trees in FPGA acceleration research, reflect the need for interpretability. Research on the CausaLearn framework [88] and other bayesian network accelerators has shown that FPGAs can effectively handle complex tasks such as probabilistic reasoning and structure learning [25], [63], [199], [200]. These implementations improve the performance of bayesian models in real-time data analysis and large-scale inference tasks by

leveraging the reconfigurability and parallel processing capabilities of FPGAs.

At the same time, implementing decision tree models (such as random forest and XGBoost) on FPGAs has also demonstrated acceleration effects, making these models remain practical in application scenarios that require fast decisionmaking [12], [141], [201]. Of particular note is the FPGA accelerator of bayesian network structure learning proposed by Nitta and Takase [200], which achieves efficient parallel processing under limited resources by iteratively using processing elements. In practical applications, such as the 37node network structure learning task, this method achieves an 8.6 times acceleration compared to software execution. In addition, the logarithmic digital system arithmetic method for sum-product networks(SPN) inference proposed by Weber et al [202] maintains sufficient accuracy and saves up to 50% of hardware resources, demonstrating the potential of FPGAs in optimizing complex probabilistic models. These studies show that FPGAs can accelerate traditional bayesian and decision tree models, provide new implementation methods for these models, and expand their application scope.

The flexibility and efficiency of FPGAs in accelerating traditional ML algorithms provide the possibility for hybrid models and new algorithm implementations. For example, studies such as FPNet explored the use of reinforcement learning to automatically design CNN architectures suitable for FPGAs, demonstrating the combination of traditional optimization techniques and deep learning [63]. In addition, some studies have also explored the combination of traditional algorithms (such as k-NN) with new storage technologies, such as computational storage devices, to solve the bandwidth bottleneck problem in data-intensive applications [195].

These innovative directions show that FPGAs can accelerate single traditional algorithms and support more complex hybrid models and new computing paradigms, providing a broad research space for the future development of ML. It is worth mentioning that the N3H-Core proposed by Gong et al. shows how to use the heterogeneous computing core and a reinforcement learning (RL) algorithm to optimize NN accelerators [19]. This method fully uses DSP and LUT resources for efficient DL inference. At the same time, the runtime tuning scheme based on bayesian optimization proposed by Zhu et al. [203] provides flexible configuration capabilities for DNN accelerators under dynamic workloads.

The evolution of these acceleration approaches has shaped the landscape of traditional ML acceleration on FPGAs over the past several years. This technological progression is reflected in the changing patterns of research focus and publication trends.

During the early exploration and application phase (2018-2020), the number of studies fluctuated between 3 and 7, with a primary focus on bayesian [25], [88], K-means [197], and RL [63] algorithms.

In the 2020-2021 period, the number of studies declined to between 2 and 4 per year. The rise of Transformer and GNN models led to a shift in research focus. And the focus

transitioned from hardware optimization in earlier stages [25], [63] to algorithm-level optimization [182], [200]. This shift suggests that hardware optimization alone had reached a plateau, prompting researchers to pursue the co-optimization of algorithms and hardware. Simultaneously, innovative trends in model fusion began to emerge. For instance, Gao et al. [182] proposed a bayesian LSTM accelerator, marking the first demonstration of the potential for combining traditional probabilistic methods with deep learning.

During the revival and integration phase (2021-2023), the number of studies peaked at 9 in 2022, then declined to 5 in 2023. Decision tree models regained attention due to their interpretability. For example, the FPDeep [204] framework integrated a decision tree acceleration module. Furthermore, hybrid models saw further development, with the combination of reinforcement learning and deep learning opening new research directions. For instance, the TD3lite [205] framework implemented efficient deep reinforcement learning on FPGAs, while the BoostGCN framework [40] combined gradient boosting trees and GNNs to achieve efficient graph data processing on FPGAs.

Traditional ML models have demonstrated sustained vitality in FPGA acceleration research. These models continue to play an essential role in FPGA acceleration through integration with emerging technologies and optimization for specific application scenarios.

### C. Trends and Outlook

This section looks at the trends in FPGA accelerator research for different ML models from 2018 to 2023, covering the above discussed categories, naemly CNNs, GNNs, attention-based networks, RNNs, and traditional ML models. By analyzing these trends, we can learn about the evolution of FPGA technology in ML, its current status, and future directions. The trends discussed in this section reflect the flexibility and potential of FPGA technology in adapting to the needs of different ML models. We can foresee that the future development of FPGA in ML acceleration will focus more on model fusion, optimization of specific application scenarios, and hardware-software co-design. For researchers, this means seeking innovation in interdisciplinary fields and closely integrating algorithm optimization with hardware design. In addition, with the development of new FPGA architectures and the continuous advancement of optimization technology, we believe that FPGAs will play an increasingly important role in the future AI hardware ecosystem, providing strong support for efficient and flexible ML implementations.

# III. REFERENCES

#### REFERENCES

- B. A. Skourt, A. El Hassani, and A. Majda, "Lung ct image segmentation using deep neural networks," *Procedia Computer Science*, vol. 127, pp. 109–113, 2018.
- [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," *Advances in neural infor*mation processing systems, vol. 25, 2012.

- [3] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proceedings of the IEEE conference on computer vision* and pattern recognition, 2016, pp. 770–778.
- [4] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "Yolov4: Optimal speed and accuracy of object detection," arXiv preprint arXiv:2004.10934, 2020.
- [5] R. Saravanan and P. Sujatha, "A state of art techniques on machine learning algorithms: a perspective of supervised learning approaches in data classification," in 2018 Second international conference on intelligent computing and control systems (ICICCS). IEEE, 2018, pp. 945–949.
- [6] S. B. Goldberg, N. Flemotomos, V. R. Martinez, M. J. Tanana, P. B. Kuo, B. T. Pace, J. L. Villatte, P. G. Georgiou, J. Van Epps, Z. E. Imel *et al.*, "Machine learning and natural language processing in psychotherapy research: Alliance as example use case." *Journal of counseling psychology*, vol. 67, no. 4, p. 438, 2020.
- [7] M. S. Murshed, J. J. Carroll, N. Khan, and F. Hussain, "Resource-aware on-device deep learning for supermarket hazard detection," in 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2020, pp. 871–876.
- [8] E. Haghighat and R. Juanes, "Sciann: A keras/tensorflow wrapper for scientific computations and physics-informed deep learning using artificial neural networks," *Computer Methods in Applied Mechanics* and Engineering, vol. 373, p. 113552, 2021.
- [9] K. Settaluri, A. Haj-Ali, Q. Huang, K. Hakhamaneshi, and B. Nikolic, "Autockt: Deep reinforcement learning of analog circuit designs," in 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020, pp. 490–495.
- [10] S. Han, J. Pool, J. Tran, and W. Dally, "Learning both weights and connections for efficient neural network," Advances in neural information processing systems, vol. 28, 2015.
- [11] H. Fan, S. Liu, M. Ferianc, H.-C. Ng, Z. Que, S. Liu, X. Niu, and W. Luk, "A real-time object detection accelerator with compressed ssdlite on fpga," in 2018 International conference on field-programmable technology (FPT). IEEE, 2018, pp. 14–21.
- [12] W. d Jiang and F. Mao, "Accelerated real-time classification of evolving data streams using adaptive random forests," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [13] S. Liu and W. Luk, "Towards an efficient accelerator for dnn-based remote sensing image segmentation on fpgas," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 187–193.
- [14] C. Ding, S. Wang, N. Liu, K. Xu, Y. Wang, and Y. Liang, "Req-yolo: A resource-aware, efficient quantization framework for object detection on fpgas," in proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, 2019, pp. 33–42.
- [15] H. Nakahara, H. Yonekawa, T. Fujii, and S. Sato, "A lightweight yolov2: A binarized cnn with a parallel support vector regression for an fpga," in *Proceedings of the 2018 ACM/SIGDA International* Symposium on field-programmable gate arrays, 2018, pp. 31–40.
- [16] K. Vipin, "Zynet: automating deep neural network implementation on low-cost reconfigurable edge computing platforms," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 323–326.
- [17] P. Mousouliotis, I. Papaefstathiou, and L. Petrou, "Squeezejet-3: an accelerator utilizing fpga mpsocs for edge cnn applications," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 236–236.
- [18] Y. Liu, S. Rai, S. Ullah, and A. Kumar, "Netpu: Prototyping a generic reconfigurable neural network accelerator architecture," in 2022 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2022, pp. 1–1.
- [19] Y. Gong, Z. Xu, Z. He, W. Zhang, X. Tu, X. Liang, and L. Jiang, "N3h-core: Neuron-designed neural network accelerator via fpgabased heterogeneous computing cores," in *Proceedings of the 2022* ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2022, pp. 112–122.
- [20] N. P. Ghanathe, V. Seshadri, R. Sharma, S. Wilton, and A. Kumar, "Mafia: Machine learning acceleration on fpgas for iot applications," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 347–354.
- [21] C. Latotzke, T. Ciesielski, and T. Gemmeke, "Design of highthroughput mixed-precision cnn accelerators on fpga," in 2022 32nd

- International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 358–365.
- [22] P. Colangelo, N. Nasiri, E. Nurvitadhi, A. Mishra, M. Margala, and K. Nealis, "Exploration of low numeric precision deep learning inference using intel® fpgas," in 2018 IEEE 26th annual international symposium on field-programmable custom computing machines (FCCM). IEEE, 2018, pp. 73–80.
- [23] D. Wu, Y. Zhang, X. Jia, L. Tian, T. Li, L. Sui, D. Xie, and Y. Shan, "A high-performance cnn processor based on fpga for mobilenets," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 136–143.
- [24] Z. Que, H. Nakahara, H. Fan, J. Meng, K. H. Tsoi, X. Niu, E. Nurvitadhi, and W. Luk, "A reconfigurable multithreaded accelerator for recurrent neural networks," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 20–28.
- [25] G. G. Ko, Y. Chai, R. A. Rutenbar, D. Brooks, and G.-Y. Wei, "Accelerating bayesian inference on structured graphs using parallel gibbs sampling," in 2019 29th international conference on field programmable logic and applications (fpl). IEEE, 2019, pp. 159–165.
- [26] R. Kuramochi and H. Nakahara, "An fpga-based low-latency accelerator for randomly wired neural networks," in 2020 30th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2020, pp. 298–303.
- [27] H. Liu, A. Panahi, D. Andrews, and A. Nelson, "An fpga-based upper-limb rehabilitation device for gesture recognition and motion evaluation using multi-task recurrent neural networks," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 3605–3615.
- [28] C. Huang, X. Dong, Z. Li, T. Song, Z. Liu, and L. Dong, "Efficient stride 2 winograd convolution method using unified transformation matrices on fpga," in 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2021, pp. 1–9.
- [29] P. Bhowmik, J. H. Pantho, J. M. Mbongue, and C. Bobda, "Esca: Event-based split-cnn architecture with data-level parallelism on ultrascale+fpga," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 176–180.
- [30] X. Di, H. Yang, Z. Huang, N. Mao, Y. Jia, and Y. Zheng, "Exploring resource-efficient acceleration algorithm for transposed convolution of gans on fpga," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 19–27.
- [31] L. Liu and S. Brown, "Leveraging fine-grained structured sparsity for cnn inference on systolic array architectures," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 301–305.
- [32] Z. Que, H. Nakahara, E. Nurvitadhi, H. Fan, C. Zeng, J. Meng, X. Niu, and W. Luk, "Optimizing reconfigurable recurrent neural networks," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 10–18.
- [33] Y. Gong, B. Liu, W. Ge, and L. Shi, "Rna: Reconfigurable 1stm accelerator with near data approximate processing," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 311–314.
- [34] V. Rybalkin and N. Wehn, "When massive GPU parallelism ain't enough: A novel hardware architectureof 2d-lstm neural network," in FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, February 23-25, 2020, 2020, pp. 111–121.
- [35] Y. Chen and M. S. Abdelfattah, "Bramac: Compute-in-bram architectures for multiply-accumulate on fpgas," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023.
- [36] D. Diamantopoulos and C. Hagleitner, "A system-level transprecision fpga accelerator for blstm using on-chip memory reshaping," in 2018 International Conference on Field-Programmable Technology (FPT). IEEE, 2018, pp. 338–341.
- [37] W. Zhang, M. Jiang, and G. Luo, "Evaluating low-memory gemms for convolutional neural network inference on fpgas," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 28–32.
- [38] N. Zhang, G. Wang, J. Wang, H. Chen, W. Liu, and L. Chen, "All adder neural networks for on-board remote sensing scene classification," *IEEE Transactions on Geoscience and Remote Sensing*, 2023.

- [39] A. Yang, Y. Li, H. Shu, J. Deng, C. Ma, Z. Li, and Q. Wang, "An opencl-based fpga accelerator for compressed yolov2," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 235–238.
- [40] B. Zhang, R. Kannan, and V. Prasanna, "Boostgen: A framework for optimizing gen inference on fpga," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 29–39.
- [41] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang, "C-lstm: Enabling efficient 1stm using structured compression techniques on fpgas," in *Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2018, pp. 11–20.
- [42] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu, and L. Zhang, "Efficient and effective sparse 1stm on fpga with bank-balanced sparsity," in *Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2019, pp. 63–72.
- [43] J. Meng, S. K. Venkataramanaiah, C. Zhou, P. Hansen, P. Whatmough, and J.-s. Seo, "Fixyfpga: Efficient fpga accelerator for deep neural networks with high element-wise sparsity and without external memory access," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 9–16.
- [44] Q. Wang, L. Zheng, Y. Huang, P. Yao, C. Gui, X. Liao, H. Jin, W. Jiang, and F. Mao, "Grasu: A fast graph update library for fpga-based dynamic graph processing," in *The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2021, pp. 149–159.
- [45] S. Liu, C. Zeng, H. Fan, H.-C. Ng, J. Meng, Z. Que, X. Niu, and W. Luk, "Memory-efficient architecture for accelerating generative networks on fpga," in 2018 International Conference on Field-Programmable Technology (FPT). IEEE, 2018, pp. 30–37.
- [46] Y. Gao, L. Gong, C. Wang, T. Wang, and X. Zhou, "Sdma: An efficient and flexible sparse-dense matrix-multiplication architecture for gnns," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 307–312.
- [47] C. Fu, S. Zhu, H. Chen, F. Koushanfar, H. Su, and J. Zhao, "Simbnn: A similarity-aware binarized neural network acceleration framework," in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp. 319–319.
- [48] S. Panchapakesan, Z. Fang, and J. Li, "Syncnn: Evaluating and accelerating spiking neural networks on fpgas," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), 2021, pp. 286–293.
- [49] S. Xu, W. Huang, and Y. Huang, "Tfr-gen: A gen accelerator with tile-fusing strategy," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–1.
- [50] M. Yingchang and Q. Liu, "M4bram: Mixed-precision matrix-matrix multiplication in fpga block rams," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [51] C. Wu, J. Zhuang, K. Wang, and L. He, "Mp-opu: A mixed precision fpga-based overlay processor for convolutional neural networks," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 33–37.
- [52] X. Yu, Y. Wang, J. Miao, E. Wu, H. Zhang, Y. Meng, B. Zhang, B. Min, D. Chen, and J. Gao, "A data-center fpga acceleration platform for convolutional neural networks," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 151–158.
- [53] Y. Li, W. Cao, X. Zhou, and L. Wang, "A low-cost reconfigurable nonlinear core for embedded dnn applications," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 35–38.
- [54] S. Yan, Z. Liu, Y. Wang, C. Zeng, Q. Liu, B. Cheng, and R. C. Cheung, "An fpga-based mobilenet accelerator considering network structure characteristics," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 17–23.
- [55] M. Sun, Z. Li, A. Lu, Y. Li, S.-E. Chang, X. Ma, X. Lin, and Z. Fang, "Film-qnn: Efficient fpga acceleration of deep neural networks with intra-layer, mixed-precision quantization," in *Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2022, pp. 134–145.

- [56] J. Wu, J. Zhou, Y. Gao, Y. Ding, N. Wong, and H. K.-H. So, "Msd: Mixing signed digit representations for hardware-efficient dnn acceleration on fpga with heterogeneous resources," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 94–104.
- [57] M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. Neto, "Hybrid dot-product calculation for convolutional neural networks in fpga," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 350–353.
- [58] H. Irmak, D. Ziener, and N. Alachiotis, "Increasing flexibility of fpga-based cnn accelerators with dynamic partial reconfiguration," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 306–311.
- [59] Y. Yu, T. Zhao, K. Wang, and L. He, "Light-opu: An fpga-based overlay processor for lightweight convolutional neural networks," in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020, pp. 122–132.
- [60] M. Hardieck, M. Kumm, K. Möller, and P. Zipf, "Reconfigurable convolutional kernels for neural networks on fpgas," in *Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2019, pp. 43–52.
- [61] H. G. M. Hernandez, "Towards the efficient multi-platform execution of deep neural networks," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 277–278.
- [62] S. I. Venieris, J. Fernandez-Marques, and N. D. Lane, "unzipfpga: Enhancing fpga-based cnn engines with on-the-fly weights generation," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 165–175.
- [63] Y. Yang, C. Wang, L. Gong, and X. Zhou, "Fpnet: Customized convolutional neural network for fpga platforms," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 399–402.
- [64] M. Sun, S. Lin, S. Liu, S. Li, Y. Wang, W. Jiang, and W. Wang, "Hardware-friendly acceleration for deep neural networks with microstructured compression," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–1.
- [65] M. Hall and V. Betz, "From tensorflow graphs to luts and wires: Automated sparse and physically aware cnn hardware generation," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 56–65.
- [66] L. Petrica, T. Alonso, M. Kroes, N. Fraser, S. Cotofana, and M. Blott, "Memory-efficient dataflow inference for deep cnns on fpga," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 48–55.
- [67] B. Biggs, C.-S. Bouganis, and G. Constantinides, "Atheena: A toolflow for hardware early-exit network automation," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 121–132.
- [68] Z. Xu, J. Yu, C. Yu, H. Shen, Y. Wang, and H. Yang, "Cnn-based feature-point extraction for real-time visual slam on embedded fpga," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 33–37.
- [69] Y. Meng, S. Kuppannagari, R. Kannan, and V. Prasanna, "Dynamap: Dynamic algorithm mapping framework for low latency cnn inference," in *The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2021, pp. 183–193.
- [70] I. Mohamed, P. Toupas, Z. Yu, and C.-S. Bouganis, "Extending data flow architectures for convolutional neural networks to multiple fpgas," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [71] D. Piyasena, S.-K. Lam, and M. Wu, "Accelerating continual learning on edge fpga," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 294– 300
- [72] J. Chen, S. Hong, W. He, J. Moon, and S.-W. Jun, "Eciton: Very low-power lstm neural network accelerator for predictive maintenance at the edge," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 1–8.
- [73] C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, "Deltarnn: A power-efficient recurrent neural network accelerator," in *Proceed*-

- ings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018, pp. 21–30.
- [74] Y. Zheng, H. Yang, Z. Huang, T. Li, and Y. Jia, "A high energy-efficiency fpga-based lstm accelerator architecture design by structured pruning and normalized linear quantization," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 271–274.
- [75] R. Rajat, H. Zeng, and V. Prasanna, "A flexible design automation tool for accelerating quantized spectral cnns," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 144–150.
- [76] Y. Dai, S. Liu, Y. Lu, H. Zhou, S. Rasoulinezhad, P. H. Leong, and L. Wang, "Apir-dsp: An approximate pir-dsp architecture for errortolerant applications," in 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2021, pp. 1–8.
- [77] J. Faraone, G. Gambardella, D. Boland, N. Fraser, M. Blott, and P. H. Leong, "Customizing low-precision deep neural networks for fpgas," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 97–973.
- [78] J. Sommer, M. A. Özkan, O. Keszocze, and J. Teich, "Dsp-packing: Squeezing low-precision arithmetic into fpga dsp blocks," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 160–166.
- [79] A. Boutros, S. Yazdanshenas, and V. Betz, "Embracing diversity: Enhanced dsp blocks for low-precision deep learning on fpgas," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 35–357.
- [80] W. Zhenyu, S. ÖMo, and H. Kwok-Hayürgen, "Ssimd: Supporting six signed multiplications in a dsp block for low-precision cnn on fpgas," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [81] S. Bian, H. Li, C. Wang, C. Song, and Y. Tang, "Msbf-lstm: Most-significant bit-first lstm accelerators with energy efficiency optimisations," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 218–218.
- [82] A. Boutros, E. Nurvitadhi, R. Ma, S. Gribok, Z. Zhao, J. C. Hoe, V. Betz, and M. Langhammer, "Beyond peak performance: Comparing the real performance of ai-optimized fpgas and gpus," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 10–19.
- [83] V. Rybalkin, J. Ney, M. K. Tekleyohannes, and N. Wehn, "When massive gpu parallelism ain't enough: A novel hardware architecture of 2d-lstm neural network," ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 15, no. 1, pp. 1–35, 2021.
- [84] V. Leon, K. Pekmestzi, and D. Soudris, "Exploiting the potential of approximate arithmetic in dsp & ai hardware accelerators," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 263–264.
- [85] N. Eriko, J. Cook Jeffrey, K. Mishra Asit, M. Debbie, N. Kevin, C. Philip, C. Ling Andrew, C. Davor, A. Utku, Y. Shumarayev Sergey et al., "In-package domain-specific asics for intel® stratix® 10 fpgas: A case study of accelerating deep learning using tensortile asic (abstract only)," in Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays, 2018.
- [86] A. Samajdar, T. Garg, T. Krishna, and N. Kapre, "Scaling the cascades: Interconnect-aware fpga implementation of machine learning problems," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 342–349.
- [87] E. Nurvitadhi, D. Kwon, A. Jafari, A. Boutros, J. Sim, P. Tomson, H. Sumbul, G. Chen, P. Knag, R. Kumar et al., "Why compete when you can work together: Fpga-asic integration for persistent rnns," in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp. 199–207.
- [88] B. Darvish Rouhani, M. Ghasemzadeh, and F. Koushanfar, "Causalearn: Automated framework for scalable streaming-based causal bayesian learning using fpgas," in *Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2018, pp. 1–10.
- [89] L. Ioannou and S. A. Fahmy, "Lightweight programmable dsp block overlay for streaming neural network acceleration," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 355–358.

- [90] A. Montgomerie-Corcoran, Z. Yu, J. Cheng, and C.-S. Bouganis, "Pass: Exploiting post-activation sparsity in streaming architectures for cnn acceleration," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 288– 293.
- [91] D. Piyasena, R. Wickramasinghe, D. Paul, S.-K. Lam, and M. Wu, "Reducing dynamic power in streaming cnn hardware accelerators by exploiting computational redundancies," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 354–359.
- [92] A. Khodamoradi, K. Denolf, and R. Kastner, "S2n2: A fpga accelerator for streaming spiking neural networks," in *The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2021, pp. 194–205.
- [93] A. Montgomerie-Corcoran, Z. Yu, and C.-S. Bouganis, "Samo: Optimised mapping of convolutional neural networks to streaming architectures," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 418–424.
- [94] A. Montgomerie-Corcoran, P. Toupas, Z. Yu, and C.-S. Bouganis, "Satay: a streaming architecture toolflow for accelerating yolo models on fpga devices," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [95] A. Jinguji, S. Sato, and H. Nakahara, "Tiny on-chip memory realization of weight sparseness split-cnns on low-end fpgas," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 229–229.
- [96] X. Wang, V. Goyal, J. Yu, V. Bertacco, A. Boutros, E. Nurvitadhi, C. Augustine, R. Iyer, and R. Das, "Compute-capable block rams for efficient deep learning acceleration on fpgas," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 88–96.
- [97] Y.-C. Lin and V. Prasanna, "A framework for graph machine learning on heterogeneous architecture," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 245–246.
- [98] S. K. Venkataramanaiah, Y. Ma, S. Yin, E. Nurvithadhi, A. Dasu, Y. Cao, and J.-s. Seo, "Automatic compiler based fpga accelerator for cnn training," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 166–172.
- [99] Y.-C. Lin, B. Zhang, and V. Prasanna, "Hp-gnn: Generating high throughput gnn training implementation on cpu-fpga heterogeneous platform," in *Proceedings of the 2022 ACM/SIGDA International* Symposium on Field-Programmable Gate Arrays, 2022, pp. 123–133.
- [100] Y. Umuroglu, Y. Akhauri, N. J. Fraser, and M. Blott, "Logicnets: Codesigned neural networks and circuits for extreme-throughput applications," in 2020 30th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2020, pp. 291–297.
- [101] Z. Yu and C.-S. Bouganis, "Streamsvd: Low-rank approximation and streaming accelerator co-design," in 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2021, pp. 1–9.
- [102] H. Nakahara, Y. Sada, M. Shimoda, K. Sayama, A. Jinguji, and S. Sato, "Fpga-based training accelerator utilizing sparseness of convolutional neural network," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 180– 186.
- [103] P. Kreowsky, J. Knapheide, and B. Stabernack, "Challenges using fpga clusters for distributed cnn training," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 347–348.
- [104] J. Knapheide, P. Kreowsky, and B. Stabernack, "Demonstrating nada: A workflow for distributed cnn training on fpga clusters," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 363–363.
- [105] M. Imani, S. Salamat, B. Khaleghi, M. Samragh, F. Koushanfar, and T. Rosing, "Sparsehd: Algorithm-hardware co-optimization for efficient high-dimensional computing," in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp. 190–198.
- [106] M. Isakov, A. Ehret, and M. Kinsy, "Closnets: Batchless dnn training with on-chip a priori sparse neural topologies," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 55–554.

- [107] M. Yingchang and Q. Liu, "An fpga-based mix-grained sparse training accelerator," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [108] ——, "Squeezeblock: A transparent weight compression scheme for deep neural networks," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [109] J. Yang, J. Kim, and J.-Y. Kim, "Learninggroup: A real-time sparse training on fpga via learnable weight grouping for multi-agent reinforcement learning," in 2022 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2022, pp. 1–9.
- [110] H. Fan, G. Wang, M. Ferianc, X. Niu, and W. Luk, "Static block floating-point quantization for convolutional neural networks on fpga," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 28–35.
- [111] Y. Wang, Q. Liu, and S. Yan, "Dqi: A dynamic quantization method for efficient convolutional neural network inference accelerators," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–1.
- [112] R. Abra, D. Denisenko, R. Allen, T. Vanderhoek, S. Wolstencroft, and M. Gibson, "Low precision networks for efficient inference on fpgas," in 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2021, pp. 1–5.
- [113] T. Sadasue and T. Isshiki, "Scalable full hardware logic architecture for gradient boosted tree training," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 234–234.
- [114] C. Song, Y. Tang, J. Liu, S. Bian, D. Deng, and H. Li, "Msdf-sgd: Most-significant digit-first stochastic gradient descent for arbitraryprecision training," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 159– 165.
- [115] C. Luo, M.-K. Sit, H. Fan, S. Liu, W. Luk, and C. Guo, "Towards efficient deep neural network training by fpga-based batch-level parallelism," *Journal of Semiconductors*, vol. 41, no. 2, p. 022403, 2020.
- [116] S. Li, S. Zhu, X. Luo, T. Luo, and W. Liu, "An efficient sparse lstm accelerator on embedded fpgas with bandwidth-oriented pruning," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 42–48.
- [117] S. Fox, J. Faraone, D. Boland, K. Vissers, and P. H. Leong, "Training deep neural networks in low-precision with high accuracy using fpgas," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 1–9.
- [118] D. H. Noronha, Z. Que, W. Luk, and S. J. Wilton, "Flexible instrumentation for live on-chip debug of machine learning training on fpgas," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 20–28.
- [119] M. Tatsumi, S.-I. Filip, C. White, O. Sentieys, and G. Lemieux, "Mixing low-precision formats in multiply-accumulate units for dnn training," in 2022 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2022, pp. 1–9.
- [120] D. Piyasena, S.-K. Lam, and M. Wu, "Edge accelerator for lifelong deep learning using streaming linear discriminant analysis," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 259–259.
- [121] E. Wang, J. J. Davis, G.-I. Stavrou, P. Y. Cheung, G. A. Constantinides, and M. Abdelfattah, "Logic shrinkage: Learned fpga netlist sparsity for efficient neural network inference," in *Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2022, pp. 101–111.
- [122] F. Jentzsch, "Hardware-aware automl for exploration of custom fpga accelerators for radioml," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 359–360.
- [123] H. Chen and C. Hao, "Hardware/software co-design for machine learning accelerators," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 233–235.
- [124] S. Salamat, M. Imani, B. Khaleghi, and T. Rosing, "F5-hd: Fast flexible fpga-based framework for refreshing hyperdimensional computing," in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019, pp. 53–62.
- [125] D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. Leong,

- "A customizable matrix multiplication framework for the intel harpv2 xeon+ fpga platform: A deep learning case study," in *Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2018, pp. 107–116.
- [126] L. Rasnayake and M. Sjalander, "Improving memory access locality for vectorized bit-serial matrix multiplication in reconfigurable computing," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 415–418.
- [127] Y. Umuroglu, L. Rasnayake, and M. Själander, "Bismo: A scalable bit-serial matrix multiplication overlay for reconfigurable computing," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 307–3077.
- [128] J. Oliver, C. Álvarez, T. Cervero, X. Martorell, J. D. Davis, and E. Ayguadé, "Accelerating spmv on fpgas through block-row compress: a task-based approach," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 151–158.
- [129] A. K. Jain, C. Ravishankar, H. Omidian, S. Kumar, M. Kulkarni, A. Tripathi, and D. Gaitonde, "Modular and lean architecture with elasticity for sparse matrix vector multiplication on fpgas," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 133–143.
- [130] J. Zhuang, J. Lau, H. Ye, Z. Yang, Y. Du, J. Lo, K. Denolf, S. Neuendorffer, A. Jones, J. Hu et al., "Charm: C omposing h eterogeneous a ccele r ators for m atrix multiply on versal acap architecture," in Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 153–164.
- [131] E. Taka, A. Arora, K.-C. Wu, and D. Marculescu, "Maxeva: Maximizing the efficiency of matrix multiplication on versal ai engine," arXiv preprint arXiv:2311.04980, 2023.
- [132] Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott, L. Lavagno, K. Vissers, J. Wawrzynek et al., "Synetgy: Algorithmhardware co-design for convnet accelerators on embedded fpgas," in Proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, 2019, pp. 23–32.
- [133] S. Panchapakesan, Z. Fang, and N. Chandrachoodan, "Easpinn: Effective automated spiking neural network evaluation on fpga," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 242–242.
- [134] Y. Zhao, X. Gao, X. Guo, J. Liu, E. Wang, R. Mullins, P. Y. Cheung, G. Constantinides, and C.-Z. Xu, "Automatic generation of multi-precision multi-arithmetic cnn accelerators for fpgas," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 45–53.
- [135] D. Danopoulos, C. Kachris, and D. Soudris, "Automatic generation of fpga kernels from open format cnn models," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 237–237.
- [136] Y. Zhao, Y. Xia, R. Loureiro, H. Zhao, U. Dolinsky, and S. Yang, "Fpl demo: A learning-based motion artefact detector for heterogeneous platforms," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 366–366.
- [137] H.-J. Kang, "Real-time object detection on 640x480 image with vgg16+ ssd," in 2019 International conference on field-programmable technology (ICFPT). IEEE, 2019, pp. 419–422.
- [138] T. Zhao, Y. Yu, K. Wang, and L. He, "Heterogeneous dual-core overlay processor for light-weight cnns," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 264–264.
- [139] A. Sohrabizadeh, J. Wang, and J. Cong, "End-to-end optimization of deep learning applications," in *Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2020, pp. 133–139.
- [140] S. Fang, L. Tian, J. Wang, S. Liang, D. Xie, Z. Chen, L. Sui, Q. Yu, X. Sun, Y. Shan et al., "Real-time object detection and semantic segmentation hardware system with deep learning networks," in 2018 International conference on field-programmable technology (FPT). IEEE, 2018, pp. 389–392.
- [141] N. R. Miniskar, A. Young, F. Liu, W. Blokland, A. Cabrera, and J. S. Vetter, "Ultra low latency machine learning for scientific edge applications," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 01–07.

- [142] T. Zhang, D. Li, H. Wang, Y. Li, X. Ma, W. Luo, Y. Wang, Y. Huang, Y. Li, Y. Zhang et al., "A-u3d: A unified 2d/3d cnn accelerator on the versal platform for disparity estimation," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 123–129.
- [143] B. Zhang, R. Kannan, V. Prasanna, and C. Busart, "Accurate, low-latency, efficient sar automatic target recognition on fpga," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 1–8.
- [144] H. Fan, H.-C. Ng, S. Liu, Z. Que, X. Niu, and W. Luk, "Reconfigurable acceleration of 3d-cnns for human action recognition with block floating-point representation," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 287–2877.
- [145] M. Jiao, Y. Li, P. Dang, W. Cao, and L. Wang, "A high performance fpga-based accelerator design for end-to-end speaker recognition system," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 215–223.
- [146] S. Tridgell, D. Boland, P. H. Leong, and S. Siddhartha, "Real-time automatic modulation classification," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 299– 302
- [147] N. Soga and H. Nakahara, "Design method for an lut network-based cnn with a sparse local convolution," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 294– 295.
- [148] J. Mu, W. Zhang, H. Liang, and S. Sinha, "A collaborative framework for fpga-based cnn design modeling and optimization," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 139–1397.
- [149] X. Jia, Y. Zhang, G. Liu, X. Yang, T. Zhang, J. Zheng, D. Xu, H. Wang, R. Zheng, S. Pareek et al., "Xvdpu: A high performance cnn accelerator on the versal platform powered by the ai engine," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 01–09.
- [150] M. Véstias, R. P. Duarte, J. T. de Sousa, and H. Neto, "Lite-cnn: A high-performance architecture to execute cnns in low density fpgas," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 399–3993.
- [151] N. Li, L. Liu, S. Wei, and S. Yin, "A high-performance inference accelerator exploiting patterned sparsity in cnns," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 243–243.
- [152] G. Zibo, P. Toupas, Z. Yu, and C.-S. Bouganis, "Efficient fpgabased accelerator for post-processing in object detection," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE 2023
- [153] A. Maclellan, L. McLaughlin, L. Crockett, and R. Stewart, "Fpga accelerated deep learning radio modulation classification using matlab system objects & pynq," in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019, pp. 246–247.
- [154] H. Nakahara, Z. Que, and W. Luk, "High-throughput convolutional neural network on an fpga by customized jpeg compression," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 1–9.
- [155] Y. Cao, C. Wang, and Y. Tang, "Explore efficient lut-based architecture for quantized convolutional neural networks on fpga," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 232–232.
- [156] T. Yang, Y. Liao, J. Shi, Y. Liang, N. Jing, and L. Jiang, "A winograd-based cnn accelerator with a fine-grained regular sparsity pattern," in 2020 30th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2020, pp. 254–261.
- [157] Y. Niu, R. Kannan, A. Srivastava, and V. Prasanna, "Reuse kernels or activations? a flexible dataflow for low-latency spectral cnn acceleration," in *Proceedings of the 2020 ACM/SIGDA International* Symposium on Field-Programmable Gate Arrays, 2020, pp. 266–276.
- [158] M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O'Connell, N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling et al., "Dla: Compiler and fpga overlay for neural network inference acceleration," in 2018 28th international conference on field programmable logic and applications (FPL). IEEE, 2018, pp. 411–4117.

- [159] H. Zeng, R. Chen, C. Zhang, and V. Prasanna, "A framework for generating high throughput cnn implementations on fpgas," in *Proceedings of the 2018 ACM/SIGDA international symposium on field-programmable gate arrays*, 2018, pp. 117–126.
- [160] S. Zhao, F. An, and H. Yu, "A 307-fps 351.7-gops/w deep learning fpga accelerator for real-time scene text recognition," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 263–266.
- [161] S. I. Venieris and C.-S. Bouganis, "f-cnnx: A toolflow for mapping multi-cnn applications on fpgas," in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 381–388.
- [162] L. C. Chan, G. Malik, and N. Kapre, "Partitioning fpga-optimized systolic arrays for fun and profit," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 144– 152.
- [163] J. Knapheide, B. Stabernack, and M. Kuhnke, "A high throughput mobilenetv2 fpga implementation based on a flexible architecture for depthwise separable convolution," in 2020 30th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2020, pp. 277–283.
- [164] Y. Sun, B. Liu, and X. Xu, "An opencl-based hybrid cnn-rnn inference accelerator on fpga," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 283–286.
- [165] A. Dua, Y. Li, and F. Ren, "Systolic-cnn: an opencl-defined scalable run-time-flexible fpga accelerator architecture for accelerating convolutional neural network inference in cloud/edge computing," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 231–231.
- [166] C. Jiang, D. Ojika, B. Patel, and H. Lam, "Optimized fpga-based deep learning accelerator for sparse cnn using high bandwidth memory," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 157–164.
- [167] S. Basalama, A. Sohrabizadeh, J. Wang, and J. Cong, "A versatile systolic array for transposed and dilated convolution on fpga," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–2.
- [168] P. Xue, L. Pan, L. Sun, and M. Huang, "Dual-line-systolic array for high performance cnn accelerator," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–1.
- [169] M. Stan, M. Hall, M. Ibrahim, and V. Betz, "Hpipe nx: Boosting cnn inference acceleration performance with ai-optimized fpgas," in 2022 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2022, pp. 1–9.
- [170] A. Anupreetham, M. Ibrahim, M. Hall, A. Boutros, A. Kuzhively, A. Mohanty, E. Nurvitadhi, V. Betz, Y. Cao, and J.-s. Seo, "End-to-end fpga-based object detection using pipelined cnn and non-maximum suppression," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 76–82.
- [171] A. Maarouf, N. El Droubi, R. Morcel, H. Hajj, M. A. Saghir, and H. Akkary, "Optimized distribution of an accelerated convolutional neural network across multiple fpgas," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 235–235.
- [172] Y. Meng, H. Men, and V. Prasanna, "Accelerating deformable convolution networks," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–1.
- [173] P. Toupas, C.-S. Bouganis, and D. Tzovaras, "fpgahart: A toolflow for throughput-oriented acceleration of 3d cnns for har onto fpgas," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023.
- [174] F. Steinert, J. Knapheide, and B. Stabernack, "Demonstration of a distributed accelerator framework for energy-efficient ml processing," in 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2021, pp. 386–386.
- [175] F. Yu, Y. Cao, and Y. Tang, "Realization of quantized neural network for super-resolution on pynq," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 233–233.

- [176] H. Deng, J. Wang, H. Ye, S. Xiao, X. Meng, and Z. Yu, "3d-vnpu: a flexible accelerator for 2d/3d cnns on fpga," in 2021 IEEE 29th annual international symposium on field-programmable custom computing machines (FCCM). IEEE, 2021, pp. 181–185.
- [177] A. Khataei, G. Singh, and K. Bazargan, "Approximate hybrid binary-unary computing with applications in bert language model and image processing," in *Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays*, 2023, pp. 165–175.
- [178] Z. Lit, M. Sun, A. Lu, H. Ma, G. Yuan, Y. Xie, H. Tang, Y. Li, M. Leeser, Z. Wang et al., "Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed-scheme quantization," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 109–116.
- [179] S. Ribes, P. Trancoso, I. Sourdis, and C.-S. Bouganis, "Mapping multiple lstm models on fpgas," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 1–9.
- [180] E. Nurvitadhi, A. Boutros, P. Budhkar, A. Jafari, D. Kwon, D. Sheffield, A. Prabhakaran, K. Gururaj, P. Appana, and M. Naik, "Scalable low-latency persistent neural machine translation on cpu server with multiple fpgas," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 307–310.
- [181] H. Chen and C. Hao, "Dgnn-booster: A generic fpga accelerator framework for dynamic graph neural network inference," arXiv preprint arXiv:2304.06831, 2023.
- [182] M. Ferianc, Z. Que, H. Fan, W. Luk, and M. Rodrigues, "Optimizing bayesian recurrent neural networks on an fpga-based accelerator," in 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2021, pp. 1–10.
- [183] P. Plagwitz, F. Hannig, and J. Teich, "Trac: Compilation-based design of transformer accelerators for fpgas," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 17–23.
- [184] W. Zhenyu, S. ÖMo, and H. Kwok-Hayürgen, "Lutnet-rc: Look-up tables networks for reservoir computing on an fpga," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [185] W. d Jiang and F. Mao, "Graft: Gnn-based adaptive framework for efficient cgra mapping," in 2023 International Conference on Field Programmable Technology (ICFPT). IEEE, 2023.
- [186] C. Liu, H. Liu, L. Zheng, Y. Huang, X. Ye, X. Liao, and H. Jin, "Fnng: A high-performance fpga-based accelerator for k-nearest neighbor graph construction," in *Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays*, 2023, pp. 67–77.
- [187] S.-Y. Huang, Y.-C. Yang, Y.-R. Su, B.-C. Lai, J. Duarte, S. Hauck, S.-C. Hsu, J.-X. Hu, and M. S. Neubauer, "Low latency edge classification gnn for particle trajectory tracking on fpgas," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 294–298.
- [188] H. Chen, A. Zakeri, F. Wen, H. E. Barkam, and M. Imani, "Hypergraf: Hyperdimensional graph-based reasoning acceleration on fpga," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 34–41.
- [189] S. Abi-Karam and C. Hao, "Gnnbuilder: An automated framework for generic graph neural network accelerator generation, simulation, and optimization," in 33rd International Conference on Field-Programmable Logic and Applications, FPL 2023, Gothenburg, Sweden, September 4-8, 2023. IEEE, 2023, pp. 212–218.
- [190] H. Zeng and V. Prasanna, "Graphact: Accelerating gcn training on cpu-fpga heterogeneous platforms," in *Proceedings of the 2020* ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020, pp. 255–265.
- [191] Y. Bai, H. Zhou, K. Zhao, M. Zhang, J. Chen, J. Yu, and K. Wang, "Ltrans-opu: A low-latency fpga-based overlay processor for transformer networks," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 283– 287.
- [192] Z. Luo, L. Lu, Y. Jin, L. Jia, and Y. Liang, "Calabash: Accelerating attention using a systolic array chain on fpgas," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 242–247.
- [193] X. Feng, Y. Li, Y. Qian, J. Gao, W. Cao, and L. Wang, "A high-precision flexible symmetry-aware architecture for element-wise activation func-

- tions," in 2021 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2021, pp. 1–4.
- [194] L. Kljucaric and A. D. George, "Clustering classification on fpgas for neuromorphic feature extraction," in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 223–223.
- [195] J.-H. Kim, Y.-R. Park, J. Do, S.-Y. Ji, and J.-Y. Kim, "Accelerating large-scale nearest neighbor search with computational storage device," in 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2021, pp. 254–254.
- [196] S. Gorgin, M. Gholamrezaei, D. Javaheri, and J.-A. Lee, "An energy-efficient k-means clustering fpga accelerator via most-significant digit first arithmetic," in 2022 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2022, pp. 1–4.
- [197] Y. Wang, Z. Zeng, B. Feng, L. Deng, and Y. Ding, "Kpynq: A work-efficient triangle-inequality based k-means on fpga," in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp. 320–320.
- [198] S. Gorgin, M. H. Gholamrezaei, D. Javaheri, and J.-A. Lee, "An efficient fpga implementation of k-nearest neighbors via online arithmetic," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–2.
- [199] R. Miyagi, R. Yasudo, K. Sano, and H. Takase, "Elastic sample filter: An fpga-based accelerator for bayesian network structure learning," FPT 2022, vol. 5, p. 310, 2022.
- [200] Y. Nitta and H. Takase, "An fpga accelerator for bayesian network structure learning with iterative use of processing elements," in 2020 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2020, pp. 29–34.
- [201] A. Gajjar, P. Kashyap, A. Aysu, P. Franzon, S. Dey, and C. Cheng, "Faxid: Fpga-accelerated xgboost inference for data centers using hls," in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2022, pp. 1–9.
- [202] L. Weber, L. Sommer, J. Oppermann, A. Molina, K. Kersting, and A. Koch, "Resource-efficient logarithmic number scale arithmetic for spn inference on fpgas," in 2019 International Conference on Field-Programmable Technology (ICFPT). IEEE, 2019, pp. 251–254.
- [203] X. Zhu, C. Gao, S. Saha, X. Zhai, and K. D. McDonald-Maier, "Bayesian optimization for efficient heterogeneous mpsoc based dnn accelerator runtime tuning," in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2023, pp. 355–356.
- [204] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt, "A framework for acceleration of cnn training on deeply-pipelined fpga clusters with work and weight load balancing," in 2018 28th international conference on field programmable logic and applications (FPL). IEEE, 2018, pp. 394–3944.
- [205] C.-W. Hu, J. Hu, and S. P. Khatri, "Td3lite: Fpga acceleration of reinforcement learning with structural and representation optimizations," in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2022, pp. 79–85.